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(54) Software scheduled superscalar computer architecture 



(57) A computing system is described in which 
groups of individual instructions are executed in parallel 
by processing pipelines, and instructions to be executed 
in parallel by different pipelines are supplied to the pipe- 
lines simultaneously. During compilation of the instruc- 
tions those which can be executed in parallel are iden- 
tified. The system includes a register for storing an ar- 
bitrary number of the instructions to be executed. The 



instructions to be executed are tagged with pipeline 
identification tags and group identification tags indica- 
tive of the pipeline to which they should be dispatched, 
and the group of instructions which may be dispatched 
during the same operation. The pipeline and group iden- 
tification tags are used to dispatch the appropriate 
groups of instructions simultaneously to the differing 
pipelines. 
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Description 

[0001] This invention relates to a method according 
to claim 1; a processor according to claim 15 and an 
apparatus with a processor according to claim 28, i.e. 
to the architecture of computing systems, and in partic- 
ular to an architecture in which groups of instructions 
may be executed in parallel, as well as to methods and 
apparatus for accomplishing that. 
[0002] A common goal in the design of computer ar- 
chitectures is to increase the speed of execution of a 
given set of instructions. Many solutions have been pro- 
posed for this problem, and these solutions generally 
can be divided into two groups. 
[0003] According to a first approach, the speed of ex- 
ecution of individual instructions is increased by using 
techniques directed to decreasing the time required to 
execute a group of instructions serially. Such techniques 
include employing simple fixed-width instructions, pipe- 
lined execution units, separate instruction and data 
caches, increasing the clock rate of the instruction proc- 
essor, employing a reduced set of instructions, using 
branch prediction techniques, and the like. As a result it 
is now possible to reduce the number of clocks to exe- 
cute an instruction to approximately one. Thus, in these 
approaches, the instruction execution rate is limited to 
the clock speed for the system. 
[0004] To push the limits of instruction execution to 
higher levels, a second approach is to issue more than 
one instruction per clock cycle, in other words, to issue 
instructions in parallel. This allows the instruction exe- 
cution rate to exceed the clock rate. There are two clas- 
sical approaches to parallel execution of instructions. 
[0005] Computing systems that fetch and examine 
several instructions simultaneously to find parallelism in 
existing instruction streams to determine if any can be 
issued together are known as superscaler computing 
systems. In a conventional superscaler system, a small 
number of independent instructions are issued in each 
clock cycle. Techniques are provided, however, to pre- 
vent more than one instruction from issuing if the instruc- 
tions fetched are dependent upon each other or do not 
meet other special criteria. There is a high hardware 
overhead associated with this hardware instruction 
scheduling process. Typical superscaler machines in- 
clude the Intel i960CA, the IBM RIOS, the Intergraph 
Clipper C400, th e Motorola 88110, the Sun SuperSparc, 
the Hewlett-Packard PA-RISC 7100, the DEC Alpha, 
and the Intel Pentium. 

[0006] Many researchers have proposed techniques 
for superscaler multiple instruction issue. Agerwala, T, 
and J. Cocke [1987] "High Performance Reduced In- 
struction Set Processors," IBM Tech. Rep . (March), pro- 
posed this approach and coined the name "superscal- 
er." IBM described a computing system based on these 
ideas, and now manufactures and sells that machine as 
the RS/6000 system. This system is capable of issuing 
up to four instructions per dock and is described in 'The 



IBM RISC System/6000 Processor," IBM J. of Res. & 
Develop . (January, 1 990) 34:1 . 
[0007] The other classical approach to parallel in- 
struction execution is to employ a "wide-word" or "very 

5 long instruction word" (VLIW) architecture. A VLIW ma- 
chine requires a new instruction set architecture with a 
wide-word format. A VLIW format instruction is a long 
fixed-width instruction that encodes multiple concurrent 
operations. VLIW systems use multiple independent 

10 functional units. Instead of issuing multiple independent 
instructions to the units, a VLIW system combines the 
multiple operations into one very long instruction. For 
example, in a VLIW system, multiple integer operations, 
floating point operations, and memory references may 

15 be combined in a single "instruction." Each VLIW in- 
struction thus includes a set of fields, each of which is 
interpreted and supplied to an appropriate functional 
unit. Although the wide-word instructions are fetched 
and executed sequentially, because each word controls 

20 the entire breadth of the parallel execution hardware, 
highly parallel operation results. Wide-word machines 
have the advantage of scheduling parallel operation 
statically, when the instructions are compiled. The fixed 
width instruction word and its parallel hardware, howev- 

25 er, are designed to fit the maximum parallelism that 
might be available in the code, and most of the time far 
less parallelism is available in the code. Thus for much 
of the execution time, most of the instruction bandwidth 
and the instruction memory are unused. 

30 [0008] There is often a very limited amount of paral- 
lelism available in a randomly chosen sequence of in- 
structions, especially if the functional units are pipe- 
lined. When the units are pipelined, operations being is- 
sued on a given dock cycle cannot depend upon the out- 

35 come of any of the previously issued operations already 
in the pipeline. Thus, to efficiently employ VLIW, many 
more parallel operations are required than the number 
of functional units. 

[0009] Another disadvantage of VLIW architectures 
40 which results from the fixed number of slots in the very 
long instruction word for classes of instructions, is that 
a typical VLIW instruction will contain information in only 
a few of its fields. This is ineffizient requiring the system 
to be designed fora circumstance that occurs only rarely 
45 - a fully populated instruction word. 

[001 0] Another disadvantage of VLIW systems is the 
need to increase the amount of code. Whenever an in- 
struction is not full, the unused functional units translate 
to wasted bits, no-ops, in the instruction coding. Thus 
so useful memory and/or instruction cache space is filled 
with useless no-op instructions. In short, VLIW ma- 
chines tend to be wasteful of memory space and mem- 
ory bandwidth except for only a very limited class of pro- 
grams. 

55 [0011] The term VLIW was coined by J. A. Fisher and 
his colleagues in Fisher, J. A., J. R. Ellis, J. C. Rutten- 
berg, and A. Nicolau [1984], "Parallel Processing: A 
Smart Compiler and a Dumb Machine," Proc. SIGPLAN 
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Conf. on Compiler Construction (June), Palo Alto, CA, 
11-16. Such a machine was commercialized by Multi- 
flow Corporation. 

[0012] For a more detailed description of both super- 
sealer and VLIW architectures, see Computer Architec- 
ture - a Quantitative Approach , John L. Hennessy and 
David A. Patterson, Morgan Kaufmann Publishers, 
1990. 

[0013] The invention is defined in claims 1 , 15 and 28. 
[0014] We have developed a computing system archi- 
tecture, which we term software-scheduled superscaler, 
which enables instructions to be executed both sequen- 
tially and in parallel, yet without wasting space in the 
instruction cache or registers. Like a wide-word ma- 
chine, we provide for static scheduling of concurrent op- 
erations at program compilation. Instructions are also 
stored and loaded into fixed width frames (equal to the 
width of a cache line). Like a superscaler machine, how- 
ever, we employ a traditional instruction set, in which 
each instruction encodes only one basic operation 
(load, store, etc.). We achieve concurrence by fetching 
and dispatching "groups" of simple individual instruc- 
tions, arranged in any order. The architecture of our in- 
vention relies upon the compilerto assign instruction se- 
quence codes to individual instructions at the time they 
are compiled. During execution these instruction se- 
quence codes are used to sort the instructions into ap- 
propriate groups and execute them in the desired order. 
Thus our architecture does not suffer the high hardware 
overhead and runtime constraints of the superscaler 
strategy, nor does it suffer the wasted instruction band- 
width and memory typical of VLIW systems. 
[0015] Our system includes a mechanism, an associ- 
ative crossbar, which routes in parallel each instruction 
in an arbitrarily selected group to an appropriate pipe- 
line, as determined by a pipeline tag applied to that in- 
struction during compilation. Preferably, the pipeline tag 
will correspond to the type of functional unit required for 
execution of that instruction, e.g., floating point unit 1. 
All instructions in a selected group can be dispatched 
simultaneously. 

[0016] Thus, in one implementation, our system in- 
cludes a cache line, register, or other means for holding 
at least one group of instructions to be executed in par- 
allel, each instruction in the group having associated 
therewith a pipeline identifier indicative of the pipeline 
for executing that instruction and a group identifier in- 
dicative of the group of instructions to be executed in 
parallel. The group identifier causes all instructions hav- 
ing the same group identifier to be executed simultane- 
ously, while the pipeline identifier causes individual in- 
structions in the group to be supplied to an appropriate 
pipeline. 

[0017] In another embodiment the register holds mul- 
tiple groups of instructions, and all of the instructions in 
each group having a common group identifier are placed 
next to each other, with the group of instructions to be 
executed first placed at one end of the register, and the 



instructions in the group to be executed last placed at 
the other end of the register. 

[0018] In another embodiment of our invention a 
method of executing arbitrary numbers of instructions in 

5 a stream of instructions in parallel includes the steps of 
compiling the instructions to determine which instruc- 
tions can be executed simultaneously, assigning group 
identifiers to sets of instructions that can be executed in 
parallel, determining a pipeline for execution of each in- 

10 struction, assigning a pipeline identifier to each instruc- 
tion, and placing the instructions in a cache line or reg- 
ister for execution by the pipelines. 
[0019] Further embodiments of the Invention are: 

is 1 . A computing system for executing groups of in- 
dividual instructions in parallel by processing pipe- 
lines, the system comprising: 

storage means for holding at least one group of 
20 instructions to be executed in parallel, each in- 

struction in the group having associated there- 
with a pipeline identifi.er indicative of the pipe- 
line for executing that instruction and a group 
identifier indicative of the group of instructions 
25 to be executed in parallel; 

means responsive to the group identifier for 
causing all instructions having the same group 
identifier to be executed at the same time; and 
means responsive to the pipeline identifier of 
30 the individual instructions in the group to supply 

each instruction in the group to be executed in 
parallel to an appropriate pipeline. 

2. A computing system as in embodiment 1 wherein 
35 the storage means includes the at least one group 

of instructions, and for each instruction the storage 
means includes the group identifier andthe pipeline 
identifier. 

3. A computing system as in embodiment 2 wherein 
40 each instruction in the at least one group of instruc- 
tions has associated therewith a different pipeline 
identifier. 

4. A computing system as in embodiment 1 wherein 
the storage means holds at least two groups of in- 

45 structions, and all of the instructions in each group 
having associated therewith a common group iden- 
tifier are placed adjacent to each other in the stor- 
age means. 

5. A computing system as in embodiment 4 wherein: 

50 

the storage means comprises a line in a cache 
memory having a fixed number of storage loca- 
tions; 

the group of instructions to be executed first is 
55 placed at one end of the line in the cache mem- 

ory, and the instructions in the group to be ex- 
ecuted last is placed at the other end of the line 
in the cache memory. 
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6. A method of executing arbitrary numbers of in- 
structions in a stream of instructions in parallel 
which have been compiled to determine which in- 
structions can be executed at the same time, the 
method comprising: 5 

in response to the compilation assigning group 
identifiers to sets of instructions which can be 
executed in parallel; 

determining a pipeline for execution of each in- io 
struction in a group to be executed; 
assigning a pipeline identifier to each instruc- 
tion in the group; and 

placing the instructions in a register for execu- 
tion by the pipelines. '5 

7. A method as in embodiment 6 further comprising 
the step of executing a group of instructions in par- 
allel. 

8. A method as in embodiment 7 wherein the regis- 20 
ter holds at least two groups of instructions, and the 
step of placing the instructions in the register for ex- 
ecution by the pipelines comprises placing the in- 
structions in each group having associated there- 
with a common group identifier adjacent to each 25 
other in the register. 

9. A method as in embodiment 8 the step of execut- 
ing a group of instructions in parallel comprises cou- 
pling the register to detection means to receive the 
group identifier of each instruction in the register 30 
and the group identifier of the next group of instruc- 
tions to be supplied to the pipelines; and 

supplying only the instructions with the next 
group identifier to the pipeline execution units. 

10. In a computing system in which groups of indi- 35 
vidual instructions are executable in parallel by 
processing pipelines, a method for supplying each 
instruction in a group to be executed in parallel to 

an appropriate pipeline, the method comprising: 

40 

storing in storage an instruction frame, the 
frame including at least one group of instruc- 
tions to be executed in parallel, each instruction 
in the group having associated therewith a 
pipeline identifier indicative of the pipeline 45 
which will execute that instruction and a group 
identifier indicative of the group identification; 
comparing the group identifier of each instruc- 
tion in the instruction frame and a group identi- 
fier of those instructions to be next executed in so 
parallel; and 

using the pipeline identifier of those instructions 
to be next executed in parallel to control an ex- 
ecution unit to execute all of the instructions in 
the group in separate pipelines. 55 

11 . In a computing system in which groups of Indi- 
vidual instructions are executable in parallel by 



processing pipelines, apparatus for routing each in- 
struction in a group to be executed in parallel to an 
appropriate pipeline, the apparatus comprising: 

storage for holding at least one group of instruc- 
tions to be executed in parallel, each instruction 
in the group having associated therewith a 
pipeline identifier indicative of the pipeline for 
executing that instruction and a group identifier 
to designate among the instructions present in 
the storage those instructions which may be si- 
multaneously supplied to the processing pipe- 
lines; 

a crossbar having a first set of connectors cou- 
pled to the storage for receiving instructions 
therefrom and a second set of connectors cou- 
pled to the processing pipelines; 
means responsive to the pipeline identifier of 
the individual instructions in the group for rout- 
ing individual instructions onto appropriate 
ones of the second set of connectors, to there- 
by supply each instruction in the group to be 
executed in parallel to the appropriate pipeline. 

12. Apparatus as in embodiment 11 wherein: 

the first set of connectors consists of a set of 
first communication buses, one for each in- 
struction in the storage; 
the second set of connectors consists of a set 
of second communication buses, one for each 
pipeline; and the means responsive to the pipe- 
line identifier comprises: 

a set of decoders coupled to the storage to 
receive as first input signals the pipeline 
identifiers and in response thereto supply 
as output signals a switch control signal; 
and 

a set of switches, coupled to the decoders, 
one switch at the intersection of each of the 
first set of connectors with the second set 
of connectors, the switches providing con- 
nections in response to receiving the 
switch control signal to thereby supply 
each instruction in the group to be execut- 
ed in parallel to the appropriate pipeline. 

13. Apparatus as in embodiment 12 further com- 
prising: 

detection means coupled to receive the group 
identifier of each instruction in the storage and 
connected to receive information regarding the 
group identifier of the next group of instructions 
to be supplied to the pipelines, and in response 
thereto supply a group control signal; and 
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wherein the set of decoders coupled to the 
storage are also coupled to the detection means to 
receive the group control signal and in response 
thereto supplies a switch control signal for only 
those instructions in the group to be supplied to the s 
pipelines. 

14. Apparatus as in embodiment 13 wherein the de- 
tection means comprises a multiplexer coupled to 
receive each of the group identifiers of instructions 

in the storage and compare them to the information '0 
regarding the group identifier of the next group of 
instructions to be supplied to the pipelines. 

15. Apparatus as in embodiment 14 wherein the 
multiplexer supplies an output signal to the decod- 
ers to indicate the group identifier of the next group ^ 
of instructions to be supplied to the pipelines. 

16. In a computing system in which groups of indi- 
vidual instructions are executable in parallel by 
processing pipelines, apparatus for routing each in- 
struction in a group to be executed in parallel to an 20 
appropriate pipeline, the apparatus comprising: 

a storage for holding an instruction frame, the 
frame including at least one group of instruc- 
tions to be executed in parallel, each instruction 25 
in the group having associated therewith a 
pipeline identifier indicative of the pipeline to 
which that instruction is to be issued and a 
group identifier indicative of the group identifi- 
cation; 30 
a crossbar switch having a first set of connec- 
tors coupled to the storage for receiving instruc- 
tions therefrom and a second set of connectors 
coupled to the processing pipelines; 
selection means connected to receive the 35 
group identification of each instruction in the in- 
struction frame and connected to receive infor- 
mation about the group identifier of those in- 
structions to be next executed in parallel for 
supplying in response thereto an output signal *o 
indicative of the next set of instructions to be 
executed in parallel; and 
decoder means coupled to receive the output 
signal and each of the pipeline identifiers of the 
instructions in the storage for selectively con- ^ 
necting ones of the first set of connectors to 
ones of the second set of connectors to thereby 
supply each instruction in the group to be exe- 
cuted in parallel to the appropriate pipeline. 

50 

1 7. Apparatus as in embodiment 1 6 wherein the first 
set of connectors consists of a set of first commu- 
nication buses, one for each instruction in the stor- 
age; 

the second set of connectors consists of a set 55 
of second communication buses, one for each pipe- 
line; 

the decoder means comprises a set of decod- 



ers coupled to receive as first input signals the pipe- 
line identifiers and as second input signals informa- 
tion about the group identifier of the next group of 
instructions to be executed by the pipelines and in 
response thereto supply as output signals a switch 
control signal; and 

the crossbar switch includes a set of switches, 
one at the intersection of each of the first set of con- 
nectors with the second set of connectors, the 
switches providing connections in response to re- 
ceiving the switch control signal to thereby supply 
each instruction in the group to be executed in par- 
allel to the appropriate pipeline. 

1 8. Apparatus as in embodiment 1 7 wherein the se- 
lection means coupled to the storage comprises a 
multiplexer coupled to receive each of the group 
identifiers of instructions in the storage and com- 
pare them to information regarding the group iden- 
tifier of the next group of instructions to be supplied 
to the pipelines. 

19. Apparatus as in embodiment 18 wherein the 
multiplexer supplies an output signal to the decod- 
ers to select the group identifier of the next group 
of instructions to be supplied to the pipelines. 

20. In a computing system in which groups of indi- 
vidual instructions are executable in parallel by 
processing pipelines, a method for transferring 
each instruction in a group to be executed through 
a crossbar switch having a first set of connectors 
coupled to the storage for receiving instructions 
therefrom and a second set of connectors coupled 
to the processing pipelines, the method comprising: 
storing in storage at least one group of instructions 
to be executed in parallel, each instruction in the 
group having associated therewith a pipeline iden- 
tifier indicative of the pipeline which will execute that 
instruction; and 

using the pipeline identifiers of the individual 
instructions in the at least one group of instructions 
which are to be executed next to control switches 
between the first set of connectors and the second 
set of connectors to thereby supply each instruction 
in the group to be executed in parallel to the appro- 
priate pipeline. 

21 . A method as in embodiment 20 wherein the step 
of using comprises: 

supplying the pipeline identifiers of the individ- 
ual instructions in the at least one group of in- 
structions to a corresponding number of decod- 
ers, each of which provides an output signal in- 
dicative of the pipeline identifiers; and 
using the decoder output signals to control the 
switches between the first set of connectors 
and the second set of connectors to thereby 
supply each instruction in the group to be exe- 
cuted in parallel to the appropriate pipeline. 
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22. A method as in embodiment 21 wherein each 
of the instructions in the storage further includes a 
group identifier to designate among the instructions 
present in the storage which may be simultaneously 
supplied to the processing pipelines, and the meth- s 
od further comprises: 

supplying information about the group identifier 
of the next group of instructions to be executed 
by the pipelines together with the group identi- 10 
tiers of the individual instructions in the at least 
one group of instructions to a selector; 
comparing the group identifier of the next group 
of instructions to be executed by the pipelines 
with the group identifiers of the individual in- 13 
structions in the at least one group of instruc- 
tions, to provide output comparison signals; 
and 

using both the output comparison signals and 
the decoder output signals to control the switch- 20 
es between the first set of connectors and the 
second set of connectors to thereby supply 
each instruction in the group to be executed in 
parallel to the appropriate pipeline. 

25 

23. In a computing system in which groups of indi- 
vidual instructions are executable in parallel by 
processing pipelines, a method for supplying each 
instruction in a group to be executed in parallel to 

an appropriate pipeline, the method comprising: 30 

storing in storage an instruction frame, the 
frame including at least one group of instruc- 
tions to be executed in parallel, each instruction 
in the group having associated therewith a 3s 
pipeline identifier indicative of the pipeline 
which will execute that instruction and a group 
identifier indicative of the group identification; 
comparing the group identifier of each instruc- 
tion in the instruction frame and a group identi- *o 
f ier of those instructions to be next executed in 
parallel; and 

using the pipeline identifier of those instructions 
to be next executed in parallel to control switch- 
es in a crossbar switch having a first set of con- 45 
nectors coupled to the storage for receiving in- 
structions therefrom and a second set of con- 
nectors coupled to the processing pipelines to 
thereby supply each instruction in the group to 
be executed in parallel to the appropriate pipe- so 
line. 

24. A computing system for executing groups of in- 
dividual instructions in parallel by processing pipe- 
lines, the system comprising: 55 

storage means for holding at least one group of 
instructions to be executed in parallel, each in- 



struction in the group having associated there- 
with a pipeline identifier indicative of the pipe- 
line for executing that instruction and a group 
identifier indicative of a group of instructions to 
be executed in parallel; 
means responsive to the group identifier for 
causing all instructions having the same group 
identifier to be executed at the same time; and 
means responsive to the pipeline identifier of 
the individual instructions in the group to supply 
each instruction in the group to be executed in 
parallel to an appropriate pipeline, 

wherein the storage means includes the at 
least one group of instructions, and for each instruc- 
tion the storage means includes the group identifier 
and the pipeline identifier. 

25. The system of embodiment 24, wherein each 
instruction in the at least one group of instructions 
has associated therewith a different pipeline identi- 
fier. 

26. The system of embodiment 24 or 25, wherein 
the storage means holds at least two groups of in- 
structions, and all of the instructions in each group 
having associated therewith a common group iden- 
tifier are placed adjacent to each other in the stor- 
age means. 

27. The system of embodiment 26, wherein the stor- 
age means comprises a line in a cache memory 
having a fixed number of storage locations; the 
group of instructions to be executed first is placed 
at one end of the line in the cache memory, and the 
instructions in the group to be executed last is 
placed at the other end of the line in the cache mem- 
ory. 

28. The system of any one of the embodiments 24 
to 27, wherein a crossbar having a first set of con- 
nectors coupled to the storage means for receiving 
instructions therefrom and a second set of connec- 
tors coupled to the processing pipelines is provided. 

29. The system of embodiment 28, wherein the first 
and the second set of connectors, respectively, con- 
sist of a set of communication buses, one for each 
instruction in the storage means and one for each 
pipeline, respectively, and wherein a set of decod- 
ers coupled to the storage means to receive as first 
input signals the pipeline identifiers and in response 
thereto supply as output signals a switch control sig- 
nal and a set of switches, coupled to the decoders 
with one switch at the intersection of each of the first 
set of connectors, are provided, the switches pro- 
viding connections in response to receiving the 
switch control signal to thereby supply each instruc- 
tion in the group to be executed in parallel to the 
appropriate pipeline. 

30. The system of embodiment 29, further compris- 
ing detection means coupled to receive the group 
identifier of each construction in the storage means 
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and connected to receive information regarding the 
group identifier of the next group of instructions to 
be supplied to the pipelines, and in response there- 
to supply a group control signal, and wherein the 
set of decoders coupled to the storage means are s 
also coupled to the detection means to receive the 
group control signal and in response thereto sup- 
plies a switch control signal for only those instruc- 
tions in the group to be supplied to the pipelines. 

31 . The system of embodiment 30, wherein the de- 10 
tection means comprises a multiplexer coupled to 
receive each of the group of identifiers of instruc- 
tions in the storage means and compare them to 

the information regarding the group identifier of the 
next group of instructions to be supplied to the pipe- 15 
lines. 

32. The system of embodiment 31, wherein the mul- 
tiplexer supplies an output signal to the decoders to 
indicate the group identifier of the next group of in- 
structions to be supplied to the pipelines. 20 

33. The system of any one of the embodiments 24 
to 32, wherein the group of instructions is an instruc- 
tion frame. 

34. The system of embodiment 33, further compris- 
ing selection means connected to receive the group 25 
identifier of each instruction in the instruction frame 
and connected to receive information about the 
group identifier of those instructions to be next ex- 
ecuted in parallel for supplying in response thereto 

an output signal indicative of the next set of instruc- 30 
tions to be executed in parallel, and decoder means 
coupled to receive the output signal and each of the 
pipeline identifiers of the instructions in the storage 
for selectively conn ecting ones of the first set of con- 
nectors to thereby supply each instruction in the 35 
group to be executed in parallel at the appropriate 
pipeline. 

35. A method for executing groups of individual in- 
structions in parallel by processing pipelines, the 
method comprising: 40 

storing in storage means for holding at least 
one group of instructions to be executed in par- 
allel, each instruction in the group having asso- 
ciated therewith a pipeline identifier indicative 45 
of the pipeline for executing that instruction and 
a group identifier indicative of a group of in- 
structions to be executed in parallel; 
comparing the group identifier of each instruc- 
tion in the group of instructions and a group so 
identifter of those instructions to be next exe- 
cuted in parallel;and 

using the pipeline identifier of those instructions 
to be next executed in parallel to control an ex- 
ecution unit to execute all of the instructions in 55 
the group in separate pipelines. 

36. The method of embodiment 35, further compris- 



ing transferring each instruction in a group to be ex- 
ecuted through a crossbar having a first set of con- 
nectors coupled to the storage means for receiving 
instructions therefrom and a second set of connec- 
tors coupled to the processing pipelines and using 
the pipeline identifiers of the individual instructions 
in the at least one group of instructions which are 
to be executed next to control switches between the 
first set of connectors and the second set of con- 
nectors to thereby supply each instruction in the 
group to be executed in parallel to the appropriate 
pipeline. 

37. The method of embodiment 36, wherein the step 
of using comprises supplying the pipeline identifiers 
of the individual instructions in the at least one 
group of instructions to a corresponding number of 
decoders, each of which provides an output signal 
indicative of the pipeline identifiers; and using the 
decoder output signals to control the switches be- 
tween the first set of connectors and the second set 
of connectors. 

38. A computing system in which groups of instruc- 
tions are issued in parallel to processing pipelines 
(0, 7), the computing system comprising: 

storage means (50; 70, 74) for holding an in- 
struction frame, the instruction frame including 
a plurality of instructions including at least one 
group of instructions, the instructions in the at 
least one group of instructions to be issued in 
parallel, the at least one group being deter- 
mined at compile time, 

wherein the instruction frame including data 
associated with the plurality of instructions in the in- 
struction frame, the data indicative of which instruc- 
tions are included in the one group of instructions 
and further indicative of processing pipelines (0 t 
7) appropriate for the plurality of instructions in the 
instruction frame; and by 

means (60; 1 00) responsive to the data asso- 
ciated with the plurality of instructions for issuing the 
instructions in the one group of instructions in par- 
allel to processing pipelines (0, 7) appropriate for 
the instructions in the one group of instructions. 

39. The computing system of embodiment 38, 
wherein at least one group of instructions compris- 
es at least two or at least three instructions. 

40. The computing system of embodiment 38 or 39, 
wherein the plurality of instructions also includes at 
least another instruction belongingto anothergroup 
of instructions, the another instruction to be issued 
after the at least one group of instructions has been 
issued. 

41. The computing system of one of the embodi- 
ments 38 to 40, wherein the processing pipelines 
(0, 7) appropriate for the instructions in the one 
group of instructions are respectively coupled to ex- 
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ecution units appropriate for the instructions in the 
one group of instructions. 

42. The computing system of one of the embodi- 
ments 38 to 41 , wherein an execution unit appropri- 
ate for one instruction in the one group of instruc- 5 
tions is a memory or an arithmetic logic unit or a 
floating point unit or a branch unit. 

43. The computing system of one of the embodi- 
ments 38 to 42, wherein an execution unit appropri- 
ate for a first instruction in the one group of instruc- io 
tions and an execution unit appropriate for a second 
instruction in the one group of instructions are sim- 
ilar. 

44. The computing system of one of the embodi- 
ments 38 to 43, wherein the plurality of instructions is 
in the instruction frame are determined at compile 
time. 

45. A method for issuing groups of instructions in 
parallel to processing pipelines (0, .... 7), the meth- 
od comprising: 20 

storing in a storage means (50; 70, 74) an in- 
struction frame, the instruction frame including 
a plurality of instructions including at least one 
group of instructions, the instructions in the at 25 
least one group of instructions to be issued in 
parallel, the at least one group being deter- 
mined at compile time, 

wherein the instruction frame includes data 30 
associated with the plurality of instructions in the in- 
struction frame, the data indicative of which instruc- 
tions are included in the one group of instructions 

and further indicative of processing pipelines (0 

7) appropriate for the plurality of instructions in the 35 
instruction frame; and 

the one group of instructions is issued in par- 
allel to processing pipelines (0, 7) appropriate for 
the instructions in the one group of instructions in 
response to the data associated with the plurality of *o 
instructions in the instruction frame. 

46. The method of embodiment 45, further compris- 
ing a compilation step, during which the plurality of 
instructions in the instructions frame and the data 
associated with the plurality of instructions in the in- 
struction frame is determined. 

47. The method of embodiment 45 or 46, wherein 
another instruction of the frame is issued to a 
processing pipeline (0, .... 7) appropriate for the an- 
other instruction in response to the data associated so 
with the plurality of instructions in the instruction 
frame following the step of issuing the one group of 
instructions. 

48. The method of one of the embodiments 45 to 

47, wherein an instruction in the group is issued to 55 
an arithmetic logic unit or a floating point unit or a 
memory unit. 

49. The method of embodiment 48, wherein another 



instruction In the group is issued to an arithmetic 
logic unit or a floating point unit or a memory unit. 

50. A processor for operating a computing system 
according to the method of one of the embodiments 
45 to 49, wherein 

at least two processing pipelines (0, 7); 

storage means (50; 70, 74) for holding at least 
one group of instructions to be issued in parallel, 
and for holding data associated with the one group 
of instructions, the data indicative of processing 
pipelines (0, 7) appropriate for the instructions in 
the at least one group of instructions; and 

means (60; 1 00) responsive to the data for is- 
suing the instructions in the one group of instruc- 
tions in parallel to processing pipelines (0, ..,7) ap- 
propriate for the instructions in the one group of in- 
structions, are provided for. 

51 . A processor comprising: 

a register file having a plurality of registers; 
an instruction set including instructions (IN- 
STRUCTION) which address the registers, 
each instruction (INSTRUCTION) being one of 
a plurality of instruction types; 
a plurality of execution units (0...7), each exe- 
cution unit (0...7) being one of a plurality of 
types, wherein each instruction type is execut- 
ed on one or more execution unit types; 
and further wherein the instructions (IN- 
STRUCTION) are encoded in frames, each 
frame including a plurality of instructions (IN- 
STRUCTION) and template bits (S,P) grouped 
together in an N-bit field, the instructions (IN- 
STRUCTION) being located in instructions 
slots of the N-bit field, the template bits (S.P) 
specifying a mapping of the instruction slots to 
the execution unit types. 

52. The processor of embodiment 51 wherein the 
template bits further specify instruction group 
boundaries within the frame, with an instruction 
group comprising a set of statically contiguous in- 
structions that are executed concurrently. 

53. The processor of embodiment 51 or 52 wherein 
the instruction types include integer arithmetic logic 
unit, memory, floating-point, and branch instruc- 
tions. 

54. The processor of one of the embodiments 51 to 

53 wherein the execution unit types include integer, 
memory, floating-point, and branch execution units. 

55. The processor of one of the embodiments 51 to 

54 wherein the frame further includes a stop-bit that 
specifies an inter-frame instruction group boundary. 

56. The processor according to one of the embod- 
iments 51 to 55, further comprising a memory that 
stores the frames, a byte order of the frames in the 
memory being in a little-endian format or in a big- 
endian format. 
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57. The processor according to one of the embod- 
iments 51 to 56 wherein the frames are ordered in 
the memory from a lowest to a highest memory ad- 
dress. 

58. The processor according to one of the embod- 5 
iments 51 to 57 wherein an instruction in the frame 
with the lowest memory address precedes an in- 
struction in the frame with the highest memory ad- 
dress. 

59. The processor according to one of the embod- 10 
iments 51 to 58 wherein the template bits are partly 
determined by hardware. 

60. The processor according to one of the embod- 
iments 51 to 59 wherein the template bits are at 
least partly determined at a compile time. 

61 . A method for operating a processor comprising: 

storing a frame of instructions (INSTRUCTION) 
in a memory, the frame including a plurality of 
instructions and template bits (S, P), the plural- 20 
ity of instructions (INSTRUCTIONS) and the 
template bits (S, P) grouped together in an N- 
bit field, the instructions (INSTRUCTION) being 
located in instruction slots of the N-bit field, the 
template bits (S,P) specifying a mapping of the 25 
instruction slots to execution unit types, each 
instruction (INSTRUCTION) being one of a plu- 
rality of instruction types from an instruction set 
which address a plurality of registers in a reg- 
ister file; and 30 
executing each instruction type on an execution 
unit (0...7) from a plurality of execution units 
(0...7) being one of a plurality of execution unit 
types, 

35 

62. The method according to embodiment 61 com- 
prising the step of specifying instruction group 
boundaries within the frame at the template bits, 
with an instruction group comprising a set of stati- 
cally contiguous instructions that are executed con- *o 
currently. 

63. The method according to embodiment 61 or 62 
comprising the step of including a stop-bit into the 
frame, said stop-bit specifying an inter-frame in- 
struction group boundary. 45 

64. The method according to embodiment 63 com- 
prising the step of specifying the instruction group 
boundary to occur after a last instruction of a current 
frame if the stop-bit is in a first condition. 

65. The method according to embodiment 64 com- so 
prising the step of specifying the instruction group 

to include the last instruction the current frame if the 
stop-bit is in a second condition. 

66. The method according to one of the embodi- 
ments 61 to 65 comprising the step of storing the 55 
frames In a memory with a byte order in a little-en- 
dian or a big-endian format. 

67. The method according to one of the embodi- 



ments 61 to 66 comprising the step of ordering the 
frames in the memory from a lowest to a highest 
memory address. 

68. The method according to one of the claims 61 
to 67 wherein the instruction types include integer 
arithmetic logic unit, memory, floating point, and 
branch units. 

69. The method according to one of the embodi- 
ments 61 to 68 comprising the step of determining 
the template bits partly by hardware. 

70. The method according to one of the embodi- 
ments 61 to 69 comprising the step of determining 
the template bits at least partly at compile time. 

71 . A memory comprising: 

a frame of instructions (INSTRUCTION), the 
frame including a plurality of instructions (IN- 
STRUCTION) and template bits (S,P), the plu- 
rality of instructions (INSTRUCTION) and the 
template bits (S,P) grouped together in an N- 
bit field, the instructions (INSTRUCTION)being 
located in instruction slots of the N-bit field, the 
template bits (S, P) specifying a mapping of the 
instruction slots to execution unit types, each 
instruction (INSTRUCTION) being one of a plu- 
rality of instruction types from an instruction set 
and which address a plurality of registers in a 
register file, 

wherein each instruction type is to be execut- 
ed on an execution unit (0 . 7) from a plurality of ex- 
ecution units (0...7) being one of a plurality of exe- 
cution unit types, and 

wherein instructions in the frame may be is- 
sued to respective execution units (0...7) at different 
times. 

72. The memory of embodiment 71 wherein the 
template bits further specify instruction group 
boundaries within the frame, with an instruction 
group comprising a set of statically contiguous in- 
structions that are executed concurrently. 

73. The memory of embodiment 71 or 72 wherein 
the frame further includes a stop-bit that specifies 
an inter-frame instruction group boundary. 

74. The memory of embodiment 73 wherein, if the 
stop bit is in a first condition, the instruction group 
includes the last instruction of the current frame. 

75. The memory according to one of the embodi- 
ments 71 to 74 wherein a byte order of the frames 
stored therein is in a little-endian format or in a big- 
endian format. 

76. The memory according to one of the embodi- 
ments 71 to 75 wherein the frames are ordered from 
a lowest to a highest memory address. 

77. The memory according to one of the embodi- 
ments 71 to 76 wherein an instruction (INSTRUC- 
TION) in the frame with the lowest memory address 
precedes an instruction in the frame with the highest 
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memory address. 

78. The memory according to one of the embodi- 
ments 71 to 77, further comprising an unused en- 
coding of the template bits being available for use 
in a future extension. 

79. The memory according to one of the embodi- 
ments 71 to 78 wherein the template bits are deriv- 
able from machine code. 

80. The memory according to one of the embodi- 
ments 71 to 79 wherein the template bits are deter- 
minable at least partly at a compile time. 

Figure 1 is a block diagram illustrating a preferred 
implementation of this invention; 
Figure 2 is a diagram illustrating the data structure 
of an instruction word in this system; 
Figure 3 is a diagram illustrating a group of instruc- 
tion words; 

Figure 4 is a diagram illustrating a frame containing 
from one to eight groups of instructions; 
Figure 5a illustrates the frame structure for one 
maximum-sized group of eight instructions; 
Figure 5b illustrates the frame structure for a typical 
mix of three intermediate sized groups of instruc- 
tions; 

Figure 5c illustrates the frame structure for eight 
minimum-sized groups, each of one instruction; 
Figure 6 illustrates an instruction word after prede- 
coding; 

Figure 7 illustrates the operation of the predecoder; 
Figure 8 is a diagram illustrating the overall struc- 
ture of the instruction cache; 
Figure 9 is diagram illustrating the manner in which 
frames are selected from the instruction cache; 
Figure 10 is a diagram illustrating the group selec- 
tion function in the associative crossbar; 
Figure 11 is a diagram illustrating the group dis- 
patch function in the associative crossbar; 
Figure 12 is a diagram illustrating a hypothetical 
frame of instructions; and 

Figure 13 is a diagram illustrating the manner in 
which the groups of instructions in Figure 12 are is- 
sued on different clock cycles. 
Figure 14 is a diagram illustrating another embodi- 
ment of the associative crossbar. 
Figure 15 is a diagram illustrating the group select 
function in further detail. 

[0020] Figure 1 is a block diagram of a computer sys- 
tem according to the preferred embodiment of this in- 
vention. Figure 1 illustrates the organization of the inte- 
grated circuit chips by which the computing system is 
formed. As depicted, the system includes a first integrat- 
ed circuit 10 that includes a central processing unit, a 
floating point unit, and an instruction cache. 
[0021] In the preferred embodiment the instruction 
cache is a 1 6 kilobyte two-way set-associative 32 byte 
line cache. A set associative cache is one in which the 



lines (or blocks) can be placed only in a restricted set of 
locations. The line is first mapped into a set, but can be 
place anywhere within that set. In a two-way set asso- 
ciative cache, two sets, or compartments, are provided, 
and each line can be placed in one compartment or the 
other. 

[0022] The system also includes a data cache chip 20 
that comprises a 32 kilobyte four-way set-associative 32 
byte line cache. The third chip 30 of the system includes 
a predecoder, a cache controller, and a memory control- 
ler. The predecoder and instruction cache are explained 
further below. For the purposes of this invention, the 
CPU, FPU, data cache, cache controller and memory 
controller all may be considered of conventional design. 
[0023] The communication paths among the chips are 
illustrated by arrows in Figure 1 . As shown, the CPU/ 
FPU and instruction cache chip communicates over a 
32 bit wide bus 1 2 with the predecoder chip 30. The as- 
terisk is used to indicate that these communications are 
multiplexed so that a 64 bit word is communicated in two 
cycles. Chip 10 also receives information over 64 bit 
wide buses 14, 1 6 from the data cache 20, and supplies 
information to the data cache 20 over three 32 bit wide 
buses 18. 

[0024] The specific functions of the predecoder are 
described in much greater detail below; however, es- 
sentially it functions to decode a 32 bit instruction re- 
ceived from the secondary cache into a 64 bit word, and 
to supply that 64 bit word to the instruction cache on chip 
10. 

[0025] The cache controller on chip 30 is activated 
whenever a first level cache miss occurs. Then the 
cache controller either goes to main memory or to the 
secondary cache to fetch the needed information. In the 
preferred embodiment the secondary cache lines are 32 
bytes and the cache has an 8 kilobyte page size. 
[0026] The data cache chip 20 communicates with the 
cache controller chip 30 over another 32 bit wide bus. 
In addition, the cache controller chip 30 communicates 
over a 64 bit wide bus 32 with the DRAM memory, over 
a 128 bit wide bus 34 with a secondary cache, and over 
a 64 bit wide bus 36 to input/output devices. 
[0027] As will be described, the system shown in Fig. 
1 includes both conventional and novel features. The 
system includes multiple pipelines able to operate in 
parallel on separate instructions. The instructions that 
can be dispatched to these parallel pipelines simultane- 
ously, in what we term "instruction groups", have been 
identified by the compiler and tagged with a group iden- 
tification tag. Thus, the group tag designates instruc- 
tions that can be executed simultaneously. Instructions 
within the group are also tagged with a pipeline tag in- 
dicative of the specific pipeline to which that instruction 
should be dispatched. This operation is also performed 
by the compiler. 

[0028] In this system, each group of instructions can 
contain an arbitrary number of instructions ordered in 
an arbitrary sequence. The only limitation is that all in- 
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structions In the group must be capable of simultaneous 
execution; e.g. there cannot be data dependency be- 
tween instructions. The instruction groups are collected 
into larger sets and are organized into fixed width 
"frames" and stored. Each frame can contain a variable 
number of tightly packed instruction groups, depending 
upon the number of instructions in each group and on 
the width of the frame. 

[0029] Below we describe this concept more fully, as 
well as describe a mechanism to route in parallel each 
instruction in an arbitrary selected group to its appropri- 
ate pipeline, as determined by the pipeline tag of the 
Instruction. 

[0030] In the following description of the word, group, 
and frame concepts mentioned above, specific bit and 
byte width are used for the word, group and frame. It 
should be appreciated that these width are arbitrary, and 
can be varied as desired. None of the general mecha- 
nisms described for achieving the resu It of this invention 
depends upon the specific implementation. 
[0031] In one embodiment of this system the central 
processing unit includes eight functional units and is ca- 
pable of executing eight instructions in parallel. We des- 
ignate these pipelines using the digits 0 to 7. Also, for 
this explanation each instruction word is 32 bits (4 bytes) 
long, with a bit, for example, the high order bit S being 
reserved as a flag for group identification. Figure 2 
therefore shows the general format of all instructions. 
As shown by Figure 2, bits 0 to 30 represent the instruc- 
tion, with the high order bit 31 reserved to flag groups 
of instructions, i.e. , collections of instructions the com- 
piler has determined may be executed in parallel. 
[0032] Figure 3 illustrates a group of instructions. A 
group of instructions consists of one to eight instructions 
(because there are eight pipelines in the preferred im- 
plementation) ordered in any arbitrary sequence; each 
of which can be dispatched to a different parallel pipeline 
simultaneously. 

[0033] Figure 4 illustrates the structure of an instruc- 
tion frame. In the preferred embodiment an instruction 
frame is 32 bytes wide and can contain up to eight in- 
struction groups, each comprising from one to eight in- 
structions. This is explained further below. 
[0034] When the instruction steam is compiled before 
execution, the compiler places instructions in the same 
group next to each other in any order within the group 
and then places that group in the frame. The instruction 
groups are ordered within the frame from left to right ac- 
cording to their issue sequence. That is, of the groups 
of instruction in the frame, the first group to issue is 
placed in the leftmost position, the second group to is- 
sue is placed in the next position to the right, etc. Thus, 
the last group of instructions to issue within that frame 
will be placed in the rightmost location in the frame. As 
explained, the group affiliation of all instructions in the 
same group is indicated by setting the S bit (bit 31 in 
Figure 2) to the same value. This value toggles back 
and forth from 0 to 1 to 0, etc. between adjacent groups 



to thereby identify the groups. Thus, all instructions in 
the first group in a frame have the S bit set to 0, all in- 
structions in the second group have the S bit set to 1 , 
all instructions in the third group have the S bit set to 0, 

5 etc. for all groups of instructions in the frame. 

[0035] To clarify the use of a frame, Figure 5 illustrates 
three different frame structures for different hypothetical 
groups of instructions. In Figure 5a the frame structure 
for a group of eight instructions, all of which can be is- 

10 sued simultaneously, is shown. The instruction words 

are designated WO, W1 W7. The S bit for each one 

of the instruction words has been set to 0 by the com- 
piler, thereby indicating that all eight instructions can be 
issued simultaneously. 

15 [0036] Figure 5 illustrates the frame structure for a 
typical mixture of three intermediate sized groups of in- 
structions. In Figure 5b three groups of instructions are 
designated Group 0, Group 1 and Group 2. Shown at 
the left-hand side of Figure 5b is Group 0 that consists 

20 of two instruction words WO and W1 . The S bit for each 
of these instructions has been set to 0. Group 1 of in- 
structions consists of three instruction words, W2, W3 
and W4, each having the S bit set to 1 . Finally, Group 2 
consists of three instruction words, W5, W6 and W7, 

25 each having its S bit set to 0. 

[0037] Figure 5c illustrates the frame structure for 
eight minimum sized groups, each consisting of a single 
instruction. Because each "group" of a single instruction 
must be issued before the next group, the S bits toggle 

30 in a sequence 01010101 as shown. 

[0038] As briefly mentioned above, in the preferred 
embodiment the group identifiers are associated with in- 
dividual instructions in a group during compilation . In the 
preferred embodiment, this is achieved by compiling the 

35 instructions to be executed using a well-known compiler 
technology. During the compilation, the instructions are 
checked for data dependencies, dependence upon pre- 
vious branch instructions, or other conditions that pre- 
clude their execution in parallel with other instructions. 

40 These steps are performed using a well-known compil- 
er. The result of the compilation is a group Identifier be- 
ing associated with each instruction. It is not necessary 
that the group identifier be added to the instruction as a 
tag, as shown in the preferred embodiment and de- 

45 scribed further below. In an alternative approach, the 
group identifier is provided as a separate tag that is later 
associated with the instruction. This makes possible the 
execution of programs on our system, without need to 
revise the word width. 

50 [0039] In addition , in some embodiments the compiler 
will determine the appropriate pipeline for execution of 
an individual instruction. This determination is essential- 
ly a determination of the type of instruction provided. For 
example, load instructions will be sent to the load pipe- 

55 line, store instructions to the store pipeline, etc. The as- 
sociation of the instruction with the give pipeline can be 
achieved either by the compiler, or by later examination 
of the instruction itself, for example during predecoding. 
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[0040] Referring again to Figure 1, in normal opera- 
tion the CPU will execute instructions from the instruc- 
tion cache, according to well-known principles. On an 
instruction cache miss, however, the entire frame con- 
taining the instruction missed is transferred from the 
main memory into the secondary cache and then into 
the primary instruction cache, or from the secondary 
cache to the primary instruction cache, where it occu- 
pies one line of the instruction cache memory. Because 
instructions are only executed out of the instruction 
cache, all instructions ultimately undergo the following 
procedure. 

[0041] At the time a frame is transferred into the in- 
struction cache, the instruction word in that frame Is pre- 
decoded by the predecoder 30 (Figure 1), which as is 
explained below decodes the retrieved instruction into 
a full 64 bit word. As part of this predecoding the S bit 
of each instruction is expanded to a full 3 bit field 000, 
001, 111, which provides the explicit binary group 
number of the instruction. In other words, the predecod- 
er, by expanding the S bit to a three bit sequence ex- 
plicitly provides information that the instruction group 
000 must execute before instruction group 010, al- 
though both groups would have all instructions within 
the group have S bits set to 0. Because of the frame 
rules for sequencing groups, these group numbers cor- 
respond to the order of issue of the groups of instruc- 
tions. Group 0 (000) will be issued first, Group 1 (001), 
if present, will be issued second, Group 2 (010) will be 
issued third. Ultimately, Group 7 (111), if present, will be 
issued last. At the time of predecoding of each instruc- 
tions, the S value of the last word in the frame, which 
belongs to the last group in the frame to issue, is stored 
in the tag field for that line in the cache, along with the 
19 bit real address and a valid bit. The valid bit is a bit 
that specifies whether the information in that line in the 
cache is valid. If the bit is not set to "valid", there cannot 
be a match or "hit" on this address. The S value from 
the last instruction, which S value is stored in the tag 
field of the line in the cache, provides a "countdown" val- 
ue that can be used to know when to increment to the 
next cache line. 

[0042] As another part of the predecoding process, a 
new 4 bit field prefix is added to each instruction giving 
the explicit pipe number of the pipeline to which that in- 
struction will be routed. The use of four bits, rather than 
three allows the system to be later expanded with addi- 
tional pipelines. Thus, at the time an instruction is sup- 
plied from the predecoder to the instruction cache, each 
instruction will have he format shown in Figure 6. As 
shown by Figure 6, bits 0 to 56 provide 57 bits for the 
instruction, bits 57, 58 and 59 form the full 3 bit S field, 
and bits 60-63 provide the 4 bit P field. 
[0043] Figure 7 illustrates the operation of the prede- 
coder in transferring a frame from memory to the instruc- 
tion cache. In the upper portion of Figure 7, the frame is 
shown with a hypothetical four groups of instructions. 
The first group consists of a single instruction, the sec- 



ond group of three instructions, and each of the third 
and fourth groups of two instructions. As described, in- 
struction is 32 bits in length and include an S bit to sep- 
arate the groups. The predecoder decodes the instruc- 

5 tion shown in the upper portion of Figure 7 into the in- 
struction shown in the lower portion of Figure 7. As 
shown, the instructions are expanded to 64 bit length, 
with each instruction including a 4 bit identification of the 
pipeline to which the instruction is to be assigned, and 

10 the expanded group field to designate the groups of in- 
structions that can be executed together. For illustration, 
hypothetical pipeline tags have been applied. Addition- 
ally, the predecoder examines each frame for the mini- 
mum number of clocks required to execute the frame, 

15 and that number is appended to the address tag 45 for 
the line. The address tag consists of bits provided for 
the real address for the line, 1 bit to designate the validity 
of the frame, and 3 bits to specify the minimum time in 
number of clock cycles, for that fame to issue. The 

20 number of clocks for the frame to issue is determined 
by the group identification number of the last word in the 
frame. At this stage, the entire frame shown in the lower 
portion of Figure 7 is present in the instruction cache. 
[0044] It may be desirable to implement the system of 

25 this invention on computer systems that already are in 
existence and therefore have instruction structures that 
have already been defined without fields for the group 
information, pipeline information, or both. In this case in 
another embodiment of this invention the group and 

30 pipeline information is supplied on a different clock cy- 
cle, then combined with the instructions in the cache. 
Such an approach can be achieved by adding a "no-op" 
instruction with fields that identify which instructions are 
in which group, and identify the pipeline for execution of 

35 the instruction, or by supplying the information relating 
to the parallel instructions in another manner. It there- 
fore should be appreciated that the manner in which the 
data arrives at the crossbar to be processed is some- 
what arbitrary. We use the word "associated" herein to 

40 designate the concept that the pipeline and group iden- 
tifiers are not required to have a fixed relationship to the 
instruction words. That is, the pipeline and group iden- 
tifiers need not be embedded within the instructions 
themselves as shown in Figure 7. Instead they may ar- 

45 rive from another means, or on a different cycle. 

[0045] Figure 8 is a simplified diagram illustrating the 
secondary cache, the predecoder, and the instruction 
cache. This drawing, as well as Figures 9, 10 and 11, 
are used to explain the manner in which the instructions 

so tagged with the P and S fields are routed to their desig- 
nated instruction pipelines. 

[0046] In Figure 8 instruction frames are fetched in a 
single transfer across a 256 bit (32 byte) wide path from 
a secondary cache 50 into the predecoder 60. As ex- 
55 plained above, the predecoder expands each 32 bit in- 
struction in the frame to its full 64 bit wide form and pre- 
fixes the P and S fields. After predecoding the 512 bit 
wide instruction is transferred into the primary instruc- 



12/12/05, EAST Version: 2.0.1 .4 



23 



EP 1 338 957 A2 



24 



tion cache 70. At the same time, tag is placed into the 
tag field 74 for that line. 

[0047] The instruction cache operates as a conven- 
tional physically-addressed instruction cache. In the ex- 
ample depicted in Figure 8, the instruction cache will s 
contain 512 bit fully-expanded instruction frames of 
eight instructions each organized in two compartments 
of 256 lines. 

[0048] Address sources for the instruction cache ar- 
rive at a multiplexer 80 that selects the next address to io 
be fetched. Because instructions are always machine 
words, the low order two address bits <1:0> of the 32 
bit address field supplied to multiplexer 80 are discard- 
ed. These two bits designate byte and half-word bound- 
aries. Of the remaining 30 bits, the next three low order is 
address bits <4:2>, which designate a particular instruc- 
tion word in a frame, are sent directly via bus 81 to the 
associative crossbar (explained in conjunction with sub- 
sequent figures). The next low eight address bits <12: 
5> are supplied over bus 82 to the instruction cache 70 20 
where they are used to select one of the 256 lines in the 
instruction cache. Finally, the remaining 19 bits of the 
virtual address <31 :1 3> are sentto the translation looka- 
side buffer (TLB) 90. The TLB translates these bits into 
the high 1 9 bits of the physical address. The TLB then 25 
supplies them over bus 84 to the instruction cache. In 
the cache they are compared with the tag of the selected 
line, to determined if there is a "hit" or a "miss" in the 
instruction cache. 

[0049] If there is a hit in the instruction cache, indicat- 30 
ing that the addressed instruction is present in the 
cache, then the selected frame containing the ad- 
dressed instruction is transferred across the 51 2 bit wide 
bus 73 into the associative crossbar 100. The associa- 
tive crossbar 100 then dispatches the addressed in- 35 
struction, with the other instructions in its group, if any, 

to the appropriate pipelines over buses 110, 111 117, 

Preferably the bit lines from the memory cells containing 
the bits of the instruction are themselves coupled to the 
associative crossbar. This eliminates the need for nu- *o 
merous sense amplifiers, and allows the crossbar to op- 
erated on the lower voltage swing information from the 
cache line directly, without the normally intervening driv- 
er circuit to slow system operation. 

[0050] Figure 9 is a block diagram illustrating in more 45 
detail the frame selection process. As shown, bits <4: 
2> of the virtual address are supplied directly to the as- 
sociative crossbar 100 over bus 81. Bus 81, as ex- 
plained above will preferably include a pair of conduc- 
tors, the bit lines, for each data bit in the field. Bits <12: so 
5> supplied over bus 82 are used to select a line in the 
instruction cache. The remaining 1 9 bits, translated into 
the 19 high order bits <31:13> of physical address, are 
used to compare against the tags of the two selected 
lines (one from each compartment of the cache) to de- 55 
termine if there is a hit in either compartment. If there is 
a hit, the two 51 2 bit wide frames are supplied to multi- 
plexer 120. The choice of which line is ultimately sup- 



plied to associative crossbar 1 00 depends upon the real 
address bits <31 :1 3> that are compared by comparators 
125. The output from comparators 125 thus selects the 
appropriate frame for transfer to the crossbar 1 00. 
[0051] Figure 10 illustrates in more detail the group 
select function of the associative crossbar. A 512 bit 
wide register 1 30, preferably formed by the SRAM cells 
in the instruction cache contains the frame of the instruc- 
tions to be issued. For the purposes of illustration, reg- 
ister 130 is shown as containing a frame having three 
groups of instructions, with Group 0 including word's 
W0.W1 and W2; Group 1 containing words W3, W4 and 
W5; and Group 2 containing words W6 and W7. For il- 
lustration, the instructions in Group 0 are to be dis- 
patched to pipelines 1 , 2 and 3; the instructions in Group 

1 to pipelines 1 , 3 and 6; and the instructions in Group 

2 to pipelines 1 and 6. The three S bits (group identifi- 
cation field) of each instruction in the frame are brought 
out to an 8:1 multiplexer 140 over buses 131, 132, 
133,..., 138, The S field of the next group of instructions 
to be executed is present in a 3 bit register 145. As 
shown in Figure 1 0, the hypothetical contents of register 
145 are 011 . These bits have been loaded into register 
145 using bus 81 described in conjunction with Figure 

9. Multiplexer 140 then compares the value in this reg- 
ister against the contents of the S field in each of the 
instruction words. If the two values match, the appropri- 
ate decoder 150 is enabled, permitting the instruction 
word to be processed on that clock cycle. If the values 
do not match , the decoder is disabled and the instruction 
words are not processed on that clock cycle. In the ex- 
ample depicted in Figure 10, the contents of register 145 
match the S field of the Group 1 instructions. The result- 
ing output, supplied over bus 142, is communicated to 
S register 144 and then to the decoders via bus 146. 
The S register contents enable decoders 153, 154 and 
155, all of which are in Group 001 . As will be shown in 
Figure 11 , this will enable these instructions W3, W4 and 
W5 to be sent to the pipelines for processing. 
[0052] Figure 11 is a block diagram illustrating the 
group dispatching of the Instructions in the group to be 
executed. The same registers are shown across the up- 
per portion of Figure 11 as in the lower portion of Figure 

10. As shown in Figure 1 1 , the crossbar switch itself con- 
sists of two sets of crossing pathways. In the horizontal 
direction are the pipeline pathways 180, 181, .... 187. 
The vertical direction are the instruction word paths 1 90, 
191, .... 197. Each of these pipeline and instruction path- 
ways is themselves a bus for transferring the instruction 
word. Each horizontal pipeline pathway is coupled to a 
pipeline execution unit 200, 201, 202, 207. Each of the 
vertical instruction word pathways 190, 191, 197 is 
coupled to an appropriate portion of register 1 30 (Figure 
10). 

[0053] The decoders 170, 171 177 associated 

with each instruction word pathway receive the 4 bit 
pipeline code from the instruction. Each decoder, for ex- 
ample decoder 170, provides as output eight 1 bit con- 
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trol lines. One of these control lines is associated with 
each pipeline pathway crossing of that instruction word 
pathway. Selection of a decoder as described with ref- 
erence to Figure 10 activates the output bit control line 
corresponding to that input pipe number. This signals 
the crossbar to close the switch between the word path 
associated with that decoder and the pipe path selected 
by that bit line. Establishing the cross connection be- 
tween these two pathways causes a selected instruction 
word to flow into the selected pipeline. For example, de- 
coder 173 has received the pipeline bit for word W3. 
Word W3 has associated with it pipeline path 1 . The 
pipeline path 1 bits are decoded to activate switch 213 
to supply instruction word W3 to pipeline execution unit 
201 over pipeline path 1 81 . In a similar manner, the iden- 
tification of pipeline path 3 for decoder D4 activates 
switch 234 to supply instruction word W4 to pipeline path 
3. Finally, the identification of pipeline 6 for word W5 in 
decoder D5 activates switch 265 to transfer instruction 
word W5 to pipeline execution unit 206 over pipeline 
pathway 186. Thus, instructions W3, W4, and W5 are 
executed by pipes 201, 203 and 206, respectively. 
[0054] The pipeline processing unites 200, 201, 
207 shown in Figure 11 can carry out desired opera- 
tions. In a preferred embodiment of the invention, each 
of the eight pipelines first includes a sense amplifier to 
detect the state of the signals on the bit lines. In one 
embodiment the pipelines include first and second arith- 
metic logic units; first and second floating point units; 
first and second load units; a store unit and a control 
unit. The particular pipeline to which a given instruction 
word is dispatched will depend upon hardware con- 
straints as well as data dependencies. 
[0055] Figure 1 2 is an example of a frame and how it 
will be executed by the pipeline processors 200-207 of 
Figure 11. As shown in Figure 12, the frame includes 
three groups of instructions. The first group, with group 
identification number 0, includes two instructions that 
can be executed by the arithmetic logic unit, a load in- 
struction and a store instruction. Because all these in- 
structions have been assigned the same group identifi- 
cation number by the compiler, all four instructions can 
execute in parallel. The second group of instructions 
consists of a single load instruction and two floating 
point instructions. Again, because each of these instruc- 
tions has been assigned "Group 1", all three instructions 
can be executed in parallel. Finally, the last instruction 
word in the frame is a branch instruction that, based up- 
on the compiler's decision, must be executed last. 
[0056] Figure 13 illustrates the execution of the in- 
structions in the frame shown in Figure 12. As shown, 
during the first clock the Group 0 instructions execute, 
during the second clock the load and floating point in- 
structions execute, and during the third clock the branch 
instruction executes, To prevent groups from being split 
across two instruction frames, an instruction frame may 
be only partially filled, where the last group is too large 
to fit entirely within the remaining space of the frame. 



[0057] Figure 14 is a diagram illustrating another em- 
bodiment of the associative crossbar. In Figure 14 nine 
pipelines 0 - 8 are shown coupled to the crossbar. The 
three bit program counter PC points to one of the in- 

5 structions in the frame, in combination with the set of 8 
group identification bits for the frame, indicating the 
group affiliation of each instruction, are used to enable 
a subset of the instructions in the frame. The enabled 
instructions are those at or above the address indicated 

io by the PC that belong to the current group. 

[0058] The execution ports that connect to the pipe- 
lines specified by the pipeline identification bits of the 
enabled instructions are then selected to multiplex out 
the appropriate instructions from the current frame. If 

15 one or more of the pipelines is not ready to receive a 
new instruction, as set of hold latches at the output of 
the execution ports prevents any of the enabled instruc- 
tions from issuing until the "busy 11 pipeline is free. Oth- 
erwise the instructions pass transparently through the 

20 hold latches into their respective pipelines. Accompany- 
ing the output of each port is a "port valid" signal that 
indicates whether the port has valid information to issue 
to the hold latch. 

[0059] Figure 1 5 is a diagram illustrating the group se- 

25 |ect function in further detail. This figure illustrates the 
mechanism used to enable an addressed group of in- 
structions within a frame. The program counter is first 
decoded into a set of 1 4 bit signals. Seven of these sig- 
nals are combined with the eight group identifiers of the 

30 current frame to determine whether each of the seven 
instructions, 1 1 to 1 7, is or is not the start of a later group. 
This information can then be combined with the other 7 
bit signals from the PC decoder to determine which of 
the eight instructions in the frame should be enabled. 

35 Using the pipeline identifying field each enabled instruc- 
tion can be combined with the other 7 bit signal to de- 
termine which of the eight instructions in the frame 
should be enabled. Each such enabled instruction can 
then signal the execution port, as determined by the 

^0 pipeline identifier, to multiplex out the enabled instruc- 
tion. Thus, if 12 is enabled, and the pipeline code is 5, 
the select line from 12 to port 5 is activated, causing 12 
to flow to the hold latch at pipe 5. 
[0060] Because the instructions that start later groups 

45 are known, the system can decide easily which instruc- 
tions starts the next group. This information is used to 
update the PC to the address of the next group of in- 
structions. If no instruction in the frame begins the next 
group, i.e., the last instruction group has been dis- 

50 patched to the pipelines, a flag is set. The flag causes 
the next frame of instructions to be brought into the 
crossbar. The PC is then reset to 1 0. Shown in the figure 
is an exemplary sequence of the values that the PC, the 
instruction enable bits and the next frame flag take on 

55 over a sequence of eight clocks extending over two 
frames. 

[0061] The processor architecture described above 
provides many unique advantages to a system using 
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this invention. The system described is extremely flexi- 
ble, enabling instructions to be executed sequentially or 
in parallel, depending entirely upon the "intelligence" of 
the compiler. As compiler technology improves, the de- 
scribed hardware can execute programs more rapidly, 
not being limited to any particular frame width, number 
of instructions capable of parallel execution, or other ex- 
ternal constraints. Importantly, the associative crossbar 
aspect of this invention relies upon the content of the 
message being decoded, not upon an external control 
circuit acting independently of the instructions being ex- 
ecuted. In essence, the associative crossbar is self di- 
rected. In the preferred embodiment the system is ca- 
pable of a parallel issue of up to eight operations per 
cycle. For amore complete description of the associative 
crossbar, see EP-A-0 652 509. 

[0062] Although the foregoing has been a description 
of the preferred embodiment of the invention, it will be 
apparent to those of skill in the art that numerous mod- 
ifications and variations may be made to the invention 
without departing from the scope as described therein. 
For example, arbitrary numbers of pipelines, arbitrary 
numbers of decoders, and different architectures may 
be employed, yet rely upon the system we have devel- 
oped. 



Claims 

1. A method for operating a computing system com- 
prising: 

fetching a first frame of instructions from an in- 
struction memory, the first frame of instructions 
comprising up to eight instructions, each in- 
struction comprising instruction data and a 
grouping bit, wherein grouping bits of the in- 
structions are indicative of groups of instruc- 
tions from the first frame of instructions, where- 
in the groups of instructions comprises at least 
one group of instructions and comprises at 
most eight groups of individual instructions, 
wherein a group of instructions are issued sep- 
arately from other groups of instructions within 
the first frame of instructions, wherein groups 
of instructions are issued from left-to-right from 
the first frame of instructions, and wherein in- 
structions within a group of instructions are is- 
sued in parallel; 

issuing instructions in a first group of instruc- 
tions from the first frame of instructions to func- 
tional units appropriate for the instructions in 
the first group of instructions in response to 
grouping bits of the Instructions in the first 
group of instructions and in response to a map- 
ping of the instructions in the first group of in- 
structions to functional units, wherein the map- 
ping is determined in response to at least a por- 



tion of instruction data in each instruction in the 
first group of instructions from the first frame of 
instructions; 

s wherein the groups of instructions from the 

first frame of instructions are specified at compile 
time. 

2. The method of claim 1 comprising: 

10 

fetching a second frame of instructions from the 
instruction memory, the second frame of in- 
structions comprising up to eight instructions, 
each instruction comprising instruction data 
15 and a grouping bit, wherein grouping bits of the 

instructions are indicative of groups of instruc- 
tions from the second frame of instructions, 
wherein the groups of instructions comprises at 
least one group of instructions and comprises 
20 at most eight groups of individual instructions, 

wherein a group of instructions are issued sep- 
arately from other groups of instructions within 
the second frame of instructions, wherein 
groups of instructions are issued from left-to- 
25 right from the second frame of instructions, and 

wherein instructions within a group of instruc- 
tions are issued in parallel; and 
issuing instructions in a first group of instruc- 
tions from the second frame of instructions to 
30 functional units appropriate for the instructions 

in the first group of instructions in response to 
grouping bits of the instructions in the first 
group of instructions and in response to a map- 
ping of the instructions in the first group of in- 
35 structions to functional units, wherein the map- 

ping is determined in response to at least a por- 
tion of instruction data in each instruction in the 
first group of instructions from the second frame 
of instructions; 

40 

wherein instructions from the first frame of in- 
structions are issued before fetching the second 
frame of instructions from the instruction memory; 
and 

45 wherein the groups of instructions from the 

second frame of instructions are specified at com- 
pile time. 

3. The method of claim of claim 2 wherein a 256 bits 
so wide first frame of instructions and 32 bits wide in- 
structions are used. 

4. The method of any of the above claims wherein a 
first frame of instructions with fewer than then eight 

55 instructions is used. 

5. The method of any of the above claims further com- 
prising: 
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issuing instructions in a second group of in- 
structions from the first frame of instructions to 
functional units appropriate for the instructions 
in the second group of instructions in response 
to grouping bits of the instructions in the second 
group of instructions and in response to a map- 
ping of the instructions in the second group of 
instructions to functional units, wherein the 
mapping is determined in response to at least 
a portion of instruction data in each instruction 
in the second group of instructions from the first 
frame of instructions. 

6. The method of claim 5 wherein a functional unit ap- 
propriate for an instruction in the second group of 
instructions from the first frame of instruction com* 
prises a floating point unit; and wherein the second 
group of instructions from the first frame of instruc- 
tions includes a floating point instruction. 

7. The method of claims any of the above claims 
wherein a first frame of instructions with exactly 
eight 32 bit-wide instructions is used. 

8. The method of any of the above claims wherein 
three or fewer groups of instructions are packed 
within the first frame of instructions thereby reduc- 
ing instruction code size. 

9. The method of any of the above claims wherein is- 
suing instructions in a first group of instructions from 
the first frame of instructions further comprises de- 
coding the instructions in the first group of instruc- 
tions before issuance to functional units appropriate 
for the instructions in the first group of instructions 
from the first frame of instructions. 

10. The method of any of the above claims wherein 
groups of instructions are not split across the first 
frame of instructions and the second frame of in- 
structions. 

1 1 . The method of any of the above claims wherein the 
groups of instructions from the first frame of instruc- 
tions and the groups of instructions from the second 
frame of instructions are specified at compile time 
in response to data dependency checking of the in- 
structions in the first frame of instructions and in re- 
sponse to data dependency checking of the instruc- 
tions in the second frame of instructions. 

12. The method of any of the above claims wherein the 
first group of instructions from the first frame of in- 
structions includes two add instructions that are is- 
sued in parallel to two arithmetic logic units. 

13. The method of any of the above claims comprising 
including in the first group of instructions from the 



first frame of instructions a load instruction that op- 
erates upon a register. 

14. The method of any of the above claims wherein a 
5 branch instruction is used as an instruction in the 

first frame of instructions. 

15. A processor comprising: 

10 a plurality of functional units; 

an instruction memory configured to store a first 
frame of instructions, the first frame of instruc- 
tions comprising up to eight instructions, each 
instruction comprising instruction data and a 
15 grouping bit, wherein grouping bits of the in- 

structions are indicative of variable length 
groups of instructions from the first frame of in- 
structions, wherein the variable length groups 
of instructions comprises at least one group of 
20 instructions and comprises at most eight 

groups of individual instructions, wherein a 
group of instructions are dispatched separately 
from other groups of instructions within the first 
frame of instructions, wherein groups of instruc- 
ts tions are dispatched from left-to-right from the 
first frame of instructions, and 

wherein instructions within a group of instructions 
are dispatched in parallel; and 
30 an instruction dispatching unit coupled to the 

instruction memory and the plurality of functional 
units, the instruction dispatching unit configured to 
dispatch instructions in a first group of instructions 
from the first frame of instructions to functional units 
35 appropriate for the instructions in the first group of 
instructions in response to grouping bits of the in- 
structions in the first group of instructions and in re- 
sponse to a mapping of the instructions in the first 
group of instructions to functional units, wherein the 
40 mapping is determined in response to at least a por- 
tion of instruction data in each instruction in the first 
group of instructions from the first frame of instruc- 
tions; and 

wherein the groups of instructions from the 
45 first frame of instructions are determined at compile 
time. 

1 6. The processor of claim 1 5 wherein the first frame of 
instructions is 256 bits-wide; and wherein the in- 

so structions in the first frame of instructions comprise 
32 bit words. 

17. The processor of any of the above claims wherein 
the first frame of instructions comprises less than 

55 eight 32 bit-wide instructions and a no operation. 

18. The processor of any of the above claims wherein 
the plurality of functional units comprises a function- 
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al unit configured to perform floating-point opera- 
tions. 

19. The processor of any of the above claims wherein 
the plurality of functional units comprises a function- 
al unit configured to perform a load operation and a 
store operation. 



20. 



21 



The processor of any of the above claims wherein 
the plurality of functional units comprises a function- 
al unit configured to perform branch instructions. 



23. The processor of any of the above claims wherein 
groups of instructions to be dispatched in parallel 
are not split across frames of instructions. 

24. The processor of any of the above claims wherein 
at least four groups of instructions are packed within 
the first frame of instructions thereby reducing 
memory bandwidth. 

25. The processor of any of the above claims wherein 
the groups of instructions from the first frame of in- 
structions are specified at compile time by the 
grouping bits of the instructions in the first frame of 
instructions in response to data dependency check- 
ing of the instructions in the first frame of instruc- 
tions. 



10 



The processor of any of the above claims wherein 
at least three groups of instructions are packed 
within the first frame of instructions thereby saving is 
memory space. 



22. The processor of any of the above claims wherein 
the instruction dispatching unit is also configured to 
fetch the first frame of instructions and configured 
to decode the instructions in the first group of in- 
structions before dispatch to functional units appro- 
priate for the instructions in the first group of instruc- 
tions from the first frame of instructions. 
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26. The processor of any of the above claims wherein 
the instructions in the first instruction frame are 
stored in little-endian format. *s 

27. The processor of any of the above claims wherein 
the plurality of functional units include functional 
units configured to perform arithmetic and logic op- 
erations. 50 



a first set of instruction data from the mem- 
ory; 

a plurality of functional units; 
an instruction memory configured to store 
a first frame of instructions in response to 
the first set of instruction data from the 
memory, the first frame of instructions com- 
prising up to eight instructions, each in- 
struction comprising instruction data and a 
grouping bit, wherein grouping bits of the 
instructions are indicative of groups of in- 
structions of variable lengths from the first 
frame of instructions, wherein the groups 
of instructions comprises at least one 
group of instructions and comprises at 
most eight groups of instructions, wherein 
a number of instructions in the groups of 
instructions comprises at least one instruc- 
tion and up to eight instructions; wherein a 
group of instructions are dispatched sepa- 
rately from other groups of instructions 
within the first frame of instructions, where- 
in groups of instructions are dispatched 
from left-to-right from the first frame of in- 
structions, and wherein instructions within 
a group of instructions are dispatched in 
parallel; and 

an instruction dispatching unit coupled to 
the instruction memory and coupled to the 
plurality of functional units, the instruction 
dispatching unit configured to fetch the first 
frame of instructions from the instruction 
memory and configured to dispatch in- 
structions in a first group of instructions 
from the first frame of instructions to func- 
tional units appropriate for the instructions 
in the first group of instructions in response 
to grouping bits of the instructions in the 
first group of instructions and in response 
to a mapping of the instructions in the first 
group of instructions to appropriate func- 
tional units, wherein the mapping is deter- 
mined in response to at least a portion of 
instruction data in each instruction in the 
first group of instructions from the first 
frame of instructions; 

wherein the groups of instructions from the 
first frame of instructions are determined at compile 
time. 



28. An apparatus comprises: 

a memory configured to store a plurality of in- 
struction data; and 

a processor coupled to the memory comprising: 
a memory controller configured to receive 



29. The apparatus of claim 28 wherein the first frame 
of instruction comprises 256 bits and the instruc- 
tions in the first frame of instructions comprises 32 

55 bits. 

30. The apparatus of any of the above claims 

wherein the memory controller is configured 
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to receive a second set of instruction data from the 
memory; 

wherein the instruction memory is also con- 
figured to store a second frame of instructions in re- 
sponse to the second set of instruction data from 
the memory, the second frame of instructions com- 
prising up to eight instructions, each instruction 
comprising instruction data and a grouping bit, 
wherein grouping bits of the instructions are indic- 
ative of groups of instructions of variable lengths 
from the second frame of instructions, wherein the 
groups of instructions comprises at least one group 
of instructions and comprises at most eight groups 
of Instructions, wherein a number of instructions in 
the groups of instructions comprises at least one in- 
struction and up to eight instructions; wherein a 
group of instructions are dispatched separately 
from other groups of instructions within the second 
frame of instructions, wherein groups of instructions 
are dispatched from left-to-right from the first frame 
of instructions, and wherein instructions within a 
group of instructions are dispatched in parallel; and 

wherein the instruction dispatching unit is also 
configured to fetch the second frame of instructions 
from the instruction memory and configured to dis- 
patch instructions in a first group of instructions from 
the second frame of instructions to functional units 
appropriate for the instructions in the first group of 
instructions in response to grouping bits of the in- 
structions in the first group of instructions and in re- 
sponse to a mapping of the instructions in the first 
group of instructions to appropriate functional units, 
wherein the mapping is determined in response to 
at least a portion of instruction data in each instruc- 
tion in the first group of instructions from the second 
frame of instructions; 

wherein the first group of instructions from the 
first frame of instructions and the first group of in- 
structions in the second wide frame of instructions 
are determined at compile time. 

31. The apparatus of any of the above claims wherein 
the first frame of instructions comprises a no oper- 
ation instruction. 

32. The apparatus of any of the above claims wherein 
the plurality of functional units comprises eightfunc- 
tional units including a floating-point unit. 

33. The apparatus of any of the above claims wherein 
the plurality of functional units also includes a func- 
tional unit configured to perform a load operation. 

34. The apparatus of any of the above claims wherein 
more than two groups of instructions are packed 
within the first frame of instructions thereby reduc- 
ing instruction code size. 



35. The apparatus of any of the above claims wherein 
the instruction dispatching unit is also configured to 
decode the instructions in the first group of instruc- 
tions before dispatch to the appropriate functional 

5 units. 

36. The apparatus of any of the above claims wherein 
the groups of instructions from the first frame of in- 
structions are specified at compile time by the 

10 grouping bits of the instructions in the first frame of 
instructions in response to data dependency check- 
ing of the instructions in the first frame of instruc- 
tions. 

15 37. The apparatus of any of the above claims 

wherein the groups of instructions from the 
first frame of instructions comprises the first group 
of instructions, a second group of instructions, and 
a third group of instructions; and 
20 wherein a number of instructions in the first 

group of instructions is different from a number of 
instructions in the second group of instructions. 

38. The apparatus of any of the above claims 
25 wherein the groups of instructions from the 

first frame of instructions comprise four groups of 
instructions; and 

wherein a number of instructions in the first 
group of instructions is different from a number of 
30 instructions in a fourth group of instructions. 

39. The apparatus of any of the above claims 

wherein a number of instructions in the first 
group of instructions is equal to a number of instruc- 
ts tions in the second group of instructions. 

40. The apparatus of any of the above claims wherein 
the number of instructions in the first group of in- 
structions is exactly one instruction. 
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